feat: make embedding model configurable for non-English search by oeniao · Pull Request #553 · MemPalace/mempalace

oeniao · 2026-04-10T16:54:56Z

Problem

The default ChromaDB embedding model (all-MiniLM-L6-v2) is English-only. Users writing in Chinese, Japanese, Korean, or other non-Latin languages get poor search results — match scores go negative and retrieved memories are irrelevant.

Solution

Add an embedding_model config option that lets users opt into any HuggingFace sentence-transformers model without patching installed files.

Configuration (priority order):

Env var: MEMPALACE_EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2
Config file: ~/.mempalace/config.json → {"embedding_model": "paraphrase-multilingual-MiniLM-L12-v2"}
Default: null → uses ChromaDB built-in (unchanged behaviour)

Example for Chinese/Japanese/Korean users:

{
  "embedding_model": "paraphrase-multilingual-MiniLM-L12-v2"
}

Changes

config.py: added DEFAULT_EMBEDDING_MODEL = None and embedding_model property with env var support (MEMPALACE_EMBEDDING_MODEL)
palace.py: added _get_embedding_function() helper that reads config and lazily imports sentence-transformers only when needed

Backward compatibility

Default is null → behaviour identical to before
sentence-transformers is not a new required dependency — only needed when embedding_model is set
Existing palaces continue to work without any migration

Motivation

Discovered this while setting up MemPalace for daily use in Chinese. Had to patch the installed package directly — this PR makes it a proper first-class config option.

Add an `embedding_model` config option that lets users switch from the default ChromaDB built-in (all-MiniLM-L6-v2, English-only) to any HuggingFace sentence-transformers model. This makes non-English search viable without patching installed files. For example, setting `paraphrase-multilingual-MiniLM-L12-v2` improves Chinese/Japanese/Korean recall significantly. Configuration (priority order): 1. Env var: MEMPALACE_EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2 2. Config: ~/.mempalace/config.json → {"embedding_model": "..."} 3. Default: null (uses ChromaDB built-in, no extra dependencies) The `sentence-transformers` package is only required when a custom model is configured; the default path is unchanged and fully backward-compatible.

web3guru888

This fills a real gap — the English-only ONNX default is a genuine pain point for non-Latin users, and patching installed files is not a reasonable workaround.

Design is sound. The layered config priority (env var → file → default) is consistent with how the rest of the codebase handles config. The lazy sentence-transformers import is exactly right: no new hard dependency, clean ImportError message telling the user exactly what to do.

One correctness concern worth flagging: _get_embedding_function() creates a new MempalaceConfig() instance on every call (and get_collection() calls it on every operation). That's minor overhead but not a real problem. The larger issue is that get_collection() passes the embedding function to both get_collection and create_collection — but ChromaDB ties the embedding function to the collection at creation time. If a palace was created with the default model and is later opened with MEMPALACE_EMBEDDING_MODEL set (or vice versa), ChromaDB will silently accept the mismatch and your embeddings become incoherent (new writes land in a different region of embedding space from old ones).

The guard against this would be a warning or error when the configured model doesn't match what the collection was created with — or at minimum a note in the docs that changing embedding_model on an existing palace requires re-mining.

Minor: no tests for the new config property or the _get_embedding_function() path. Even a couple of unit tests with a mock model name would catch regressions.

The core feature is right and the backward-compatibility story is solid (null default = unchanged behaviour). The model-mismatch case is the one thing worth addressing before merge — even just a doc note would be better than nothing.

[MemPalace-AGI integration — production stats at https://milla-jovovich.github.io/mempalace/integrations/mempalace-agi/]

web3guru888 · 2026-04-10T17:16:22Z

This is the right fix for #516 — and the design is clean.

The layered config priority (env var → config.json → default) is the correct approach. Env var support means CI/CD and containerized setups can override without touching config files, and the config.json path keeps it user-friendly for interactive use.

A few notes from running non-English content through our integration:

Backward-compat risk: If a user sets embedding_model on an existing palace, the collection's embedding space changes but the stored vectors don't. ChromaDB will silently accept the new function and produce wrong similarity scores — or fail with a dimension mismatch if the model has different output dims. Worth adding a note in the config docstring (and ideally a startup check) that this setting must be decided before first use. A warning like ⚠ Existing palace detected — ensure embedding_model matches the model used at creation would save people pain.

Empty string edge case: config.py:L164 — return env_val or None correctly converts empty string to None, but if env_val: above it would return "" which is falsy and would fall through to the config file check anyway. The logic is fine but slightly redundant — env_val = os.environ.get(...) or None at the top would be clearer.

Missing get_collection call sites: The _get_embedding_function() helper is only wired into get_collection(). But there are a few call sites in the codebase that construct their own client directly (search, exporter). Worth confirming those all go through get_collection so the embedding function is consistently applied.

The lazy import for sentence-transformers with the clear error message is exactly the right pattern — no silent dependency pollution for English-only users.

Overall this unblocks a real pain point for non-English users. LGTM with the backward-compat caveat addressed.

mvalentsev · 2026-04-10T21:36:12Z

#442 covers the same feature (configurable embedding model via env var / config.json) with a broader scope -- it patches all 7 ChromaDB consumer modules, adds mismatch detection, and includes a test suite. Both PRs touch config.py and palace.py for the same purpose so they'll conflict.

web3guru888 · 2026-04-10T21:46:46Z

Good callout @mvalentsev — yes, both PRs target the same config surface and will conflict.

The scopes are genuinely different though: this PR (#553) is minimal — configurable model for the primary search + mine path. #442 goes wider (all 7 ChromaDB consumers + mismatch detection + a test suite). Both are valid approaches.

Whether to merge #442 first, close this one, or merge this as a stopgap probably depends on #442's merge timeline and maintainer preference. If #442 is close to ready, that's likely the better merge path given the broader coverage. If #442 is stalled, this is a fast unblock for non-English users.

Either way, @oeniao it would be worth opening a quick coordination comment on #442 to flag the conflict — or offering to pull in mvalentsev's mismatch detection test idea into this PR if #442 ends up waiting.

[MemPalace-AGI integration — 215 tests, 710 KG entities]

NickShtefan · 2026-04-10T22:28:32Z

Hey @oeniao — thanks for tackling this, non-English search is definitely a pain point.

Just a heads-up: #442 addresses the same problem with broader coverage — all 7 ChromaDB consumer modules are patched (not just palace.py), plus it includes embedding model mismatch detection, a re-mine command for model switching, configurable chunk size, device support (MPS/CUDA), and a test suite. It was just rebased on latest main today.

One note on the model choice: paraphrase-multilingual-MiniLM-L12-v2 has a 128-token context window which truncates longer chunks. I tested it and switched to intfloat/multilingual-e5-base (512 tokens) — much better for real-world content.

Since both PRs touch the same config surface and will conflict, it probably makes sense to coordinate. Happy to discuss!

ORT_DISABLE_COREML is not a recognized ONNX Runtime environment variable. ONNX Runtime does not expose a global env var to disable individual execution providers -- providers are selected per session via the providers argument to InferenceSession. Setting it had zero effect, so the CoreMLExecutionProvider was still loaded on Apple Silicon and the segfault from MemPalace#74 was not actually mitigated. Replace the misleading setdefault call with a comment that records the history and points at the real fix path. The proper CoreML workaround requires passing preferred_providers at the ChromaDB embedding function level, which is a larger change that belongs in its own PR once the configurable embedding model work in MemPalace#442 / MemPalace#553 lands. Closes MemPalace#397

oeniao requested review from bensig and milla-jovovich as code owners April 10, 2026 16:54

web3guru888 reviewed Apr 10, 2026

View reviewed changes

web3guru888 mentioned this pull request Apr 10, 2026

feat: batch writes, concurrent mining, MCP tools, hooks, export, search improvements #562

Closed

11 tasks

vincent067 mentioned this pull request Apr 11, 2026

Feature Request: Improve Chinese semantic search quality #516

Open

bensig changed the base branch from main to develop April 11, 2026 22:21

mvalentsev mentioned this pull request Apr 12, 2026

fix: remove no-op ORT_DISABLE_COREML env var (#397) #653

Merged

bonusversion mentioned this pull request Apr 12, 2026

feat: configurable embedding model for multilingual support #699

Closed

igorls added area/search Search and retrieval enhancement New feature or request labels Apr 14, 2026

matrix9neonebuchadnezzar2199-sketch mentioned this pull request Apr 15, 2026

RFC: Synapse Phase 10–14 — Model Guard, Cross-Wing Balancing, Score Explainability, Adaptive Compaction, Paginated Scoring #914

Open

mvalentsev mentioned this pull request Apr 17, 2026

earcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded #973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make embedding model configurable for non-English search#553

feat: make embedding model configurable for non-English search#553
oeniao wants to merge 1 commit intoMemPalace:developfrom
oeniao:feat/configurable-embedding-model

oeniao commented Apr 10, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

mvalentsev commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

NickShtefan commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

oeniao commented Apr 10, 2026

Problem

Solution

Changes

Backward compatibility

Motivation

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

mvalentsev commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

NickShtefan commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants